This repository was archived by the owner on Oct 11, 2024. It is now read-only.
[1/N] Rs/vllm quantization - Refactor to minimize llama.py changes#186
Merged
varun-sundar-rabindranath merged 12 commits intovllm-quantizationfrom Apr 16, 2024
Merged
[1/N] Rs/vllm quantization - Refactor to minimize llama.py changes#186varun-sundar-rabindranath merged 12 commits intovllm-quantizationfrom
llama.py changes#186varun-sundar-rabindranath merged 12 commits intovllm-quantizationfrom
Conversation
added 10 commits
April 12, 2024 20:30
… to remove changes to Llama.py
llama.py changes
vllm/model_executor/layers/linear.py
Outdated
| output_size_per_partition: int, input_size: int, | ||
| output_size: int, | ||
| params_dtype: torch.dtype) -> Dict[str, Any]: | ||
| params_dtype: torch.dtype, logical_widths: Optional[List[int]]) -> Dict[str, Any]: |
There was a problem hiding this comment.
lift this to be inside LinearMethodBase ?
Collaborator
Author
There was a problem hiding this comment.
I got rid of this on the next pr
| @@ -1,8 +1,9 @@ | |||
| from typing import Any, Dict, List, Tuple, Optional | |||
vllm/model_executor/model_loader.py
Outdated
| if _is_support_smoothquant(model_config): | ||
| model = model_class(model_config.hf_config, linear_method, | ||
| quant_config) | ||
| model = model_class(model_config.hf_config, linear_method) |
There was a problem hiding this comment.
How come we don't have to pass in the quant_config ? because the linearmethod already knows if it is quantized ?
Collaborator
Author
There was a problem hiding this comment.
Yeah linear method handles it
|
LGTM. |
varun-sundar-rabindranath
approved these changes
Apr 16, 2024
… via config (#188) Refactored to support nonuniform quantization by adding a new layer of Abstraction. Now, `SmoothQuantLinearMethod` can hold a `SmoothQuantFormat`, which implements the details of how to do quant and dequant operations. There are two `SmoothQuantFormat` classes: - `SmoothQuantDynamicPerToken` - `SmoothQuantStaticPerTensor` We have the following lifecycle: - `LinearMethod` is created during `get_model`, has access to `QuantizationConfig` - `Layer` is initialized and passed a `LinearMethod` - `Layer` calls `LinearMethod.create_weights`, which creates a dictionary of weights and metadata - `Layer` calls `LinearMethod.apply_weights` during inference, passing the dictionary created during `create_weights` This PR modifies the `LinearMethod.create_weights` API to receive a `layer_name` as argument. The `LinearMethod` then looks in the `config` to determine which `SmoothQuantFormat` to use for the layer with `layer_name` - As a result, the `LinearMethod` is responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since the `SmoothQuantConfig` is not very good, we just match on the suffix `qkv` to determine what each layer should use --> but for SparseMLConfig, we could use a similar structure In this PR, the `SmoothQuantFormat` is passed in the dictionary returned by `create_weights` and then is used by `apply_weights` ### In Summary I think this is a good overall structure because it: - (a) allows us to make minimal changes to the existing models - (b) allows us to make no changes to the model loading lifecycle (i.e. config / constructor / linear method) ** critically requires having one LinearMethod that propagates through the whole model - (c) encapsulates the nonuniform logic into the `LinearMethod`, allowing us to have a clean interface into ### For SparseML Models We could imagine the following architecture: #### Config Config is responsible for: - loading config from disk - mapping layer_names --> `SparseMLFormat` ```python class SparseMLConfig def from_dict() def get_layer_format(layer_name): return SparseMLFormat ``` #### LinearMethod Config is responsible for: - interface between layers and kernels (so LinearMethod is what is used by the model) ```python class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def create_weights(layer_name, ...): # this, e.g. is where nonuniform might be supported format = self.sparseml_config.get_layer_format(layer_name) weights = format.get_weights() weights["format"] = format return weights # wrapper around the SparseML format def apply_weights(x, weights, ...) format = weights["format"] weights = weights["weights"] return format.apply_weights(x, weights) ``` #### SparseMLFormat Format is responsible for: - actual weight creation and forward ```python class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def get_weights(sizes): # returns dictionary , e.g. return { "weights": x "scales": y } def apply_weights(weights, x): # calls cuda kernel return output ``` Sample Formats: - `W8A8DynamicPerToken` - `SparseW8A8StaticPerTensorAsymmetric` - `W4A8DynamicPerToken` - ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Paired with @dsikka to refactor
SmoothQuantLinearMethodto avoid making changes tollama.pySmoothQuantLinearMethodby making the indexing (splitting QKV into logical shards generic and explicitly handling state_dict converionllama.pyMany todos left, including:
use_per_token, need to use the quant config for this